Understand the importance of data exploration before analysis
Apply exploratory functions to explore and summarise datasets
Identify different data types and structures in datasets
Select appropriate visualization types based on data characteristics
Understand the Grammar of Graphics approach to data visualisation
Create and customise visualisations in R using the ggplot2 package
Build plots layer-by-layer using the ggplot2 framework
Interpret distributions, including skewness, kurtosis, and outliers
Quick checklist
By now you should have…
Core concepts
Data exploration
Why explore data before analysis?
Identify patterns, outliers, and relationships
Detect data quality issues
Guide selection of appropriate statistical methods
Avoid incorrect conclusions from flawed data
The data exploration workflow
Understand data structure and types
Examine distributions and summary statistics
Visualise relationships between variables
Identify patterns and anomalies
Types of data: recap from Week 1
Data in R can be broadly categorized as either categorical or continuous.
Different data types require different analysis approaches
Understanding data types helps select appropriate visualisations
R stores different data types in specific formats (which is why we need to know what they are when we import data!)
Categorical data
Nominal: no natural order, e.g.
Species (dog, cat, fish)
Hair colour (black, brown, blonde)
Blood type (A, B, AB, O)
Ordinal: natural order exists, e.g.
Education (primary, secondary, tertiary)
Pain scale (mild, moderate, severe)
T-shirt sizes (S, M, L, XL)
Continuous data
Interval: equal intervals, no true zero, e.g.
Temperature in °C (0°C isn’t “no temperature”)
Calendar dates
pH scale
Ratio: equal intervals with true zero, e.g.
Height (0 cm = no height)
Weight (0 kg = no weight)
Age (0 years = birth)
Different types of data are distributed differently
Understanding how data is distributed is crucial for selecting appropriate ways to explain it to others.
Normal distribution: Introduction
Bell-shaped, symmetric curve
Defined by mean (μ) and standard deviation (\sigma)
Many natural phenomena follow this distribution
Heights of individuals in a population
Measurement errors
Many physiological traits
X \sim N(\mu, \sigma^2)
The random variable X follows a normal distribution with mean \mu and variance \sigma^2
What does a normal distribution look like?
Code
# Generate normal distribution dataset.seed(123)normal_data <-rnorm(1000, mean =0, sd =1)# Plot normal distributionggplot(data.frame(x = normal_data), aes(x = x)) +geom_histogram(aes(y =after_stat(density)),bins =30,fill ="skyblue",colour ="black") +geom_density(colour ="red") +labs(title ="Standard Normal Distribution (μ = 0, σ = 1)",x ="Value",y ="Density")
Properties of normal distribution: The empirical rule
Mean = median = mode
~68% of data within 1\sigma of mean
~95% of data within 2\sigma of mean
~99.7% of data within 3\sigma of mean
The empirical rule visualised
Code
# Create a standard normal distributionx <-seq(-4, 4, length.out =1000)y <-dnorm(x)df <-data.frame(x = x, y = y)# Plot with empirical rule highlightedggplot(df, aes(x = x, y = y)) +geom_line() +# Add vertical reference lines at standard deviationsgeom_vline(xintercept =c(-3, -2, -1, 0, 1, 2, 3), linetype ="dashed", colour ="gray50", alpha =0.7) +geom_area(data =subset(df, x >=-1& x <=1), fill ="darkblue", alpha =0.3) +geom_area(data =subset(df, (x >=-2& x <-1) | (x >1& x <=2)), fill ="darkgreen", alpha =0.3) +geom_area(data =subset(df, (x >=-3& x <-2) | (x >2& x <=3)), fill ="darkred", alpha =0.3) +annotate("text", x =0, y =0.2, label ="68%", colour ="darkblue") +annotate("text", x =1.5, y =0.1, label ="95%", colour ="darkgreen") +annotate("text", x =2.5, y =0.05, label ="99.7%", colour ="darkred") +labs(title ="Normal Distribution: Empirical Rule",x ="Standard Deviations from Mean",y ="Density")
Why normal distributions matter in data exploration
When exploring data, understanding distributions helps you:
Identify patterns and anomalies
Is your data normally distributed as expected?
Are there unexpected skews or outliers?
Choose appropriate analysis methods
Many statistical tests assume normality
Non-normal data may require different approaches
Interpret results correctly
Context for understanding how unusual a value is
Framework for making statistical inferences
Example
Many biological traits follow normal distributions. For example, plant heights within a species:
Code
# Simulate plant height dataset.seed(456)plant_heights <-rnorm(200, mean =25, sd =3) # Heights in cm# Plot the distributionggplot(data.frame(height = plant_heights), aes(x = height)) +geom_histogram(aes(y =after_stat(density)),bins =20,fill ="#66c2a5",colour ="black") +geom_density(colour ="#1f78b4", linewidth =1) +geom_vline(xintercept =25, linetype ="dashed", colour ="red") +annotate("text", x =25.5, y =0.05, label ="μ = 25 cm", colour ="red") +labs(title ="Distribution of Plant Heights in a Population",subtitle ="Example of a biological trait following normal distribution",x ="Height (cm)",y ="Density")
This example shows how plant heights cluster around the mean (25 cm) following a normal distribution pattern. This helps researchers identify outliers, establish experimental categories, and detect environmental effects on growth patterns.
Skewness
What is skewness?
Measure of asymmetry in a distribution
Indicates which side of the distribution has a longer tail
Important for selecting appropriate statistical tests
species height weight
1 A 1.65 60
2 B 1.70 65
3 C 1.75 70
4 A 1.80 75
5 B 1.85 80
Other data structures include lists, matrices, arrays, and factors, but these are less common at your level.
Common functions
Use these essential functions to understand your data structure and summary statistics:
# Core function 1: Structure overviewstr(df)
'data.frame': 5 obs. of 3 variables:
$ species: chr "A" "B" "C" "A" ...
$ height : num 1.65 1.7 1.75 1.8 1.85
$ weight : num 60 65 70 75 80
# Core function 2: Statistical summarysummary(df)
species height weight
Length:5 Min. :1.65 Min. :60
Class :character 1st Qu.:1.70 1st Qu.:65
Mode :character Median :1.75 Median :70
Mean :1.75 Mean :70
3rd Qu.:1.80 3rd Qu.:75
Max. :1.85 Max. :80
The summary() function provides a quick overview of your data and can help identify skewness, outliers, and missing values…but it isn’t always enough.
Your options are endless (almost)
There are many specialised functions for exploring different aspects of your data:
# Check for unique values in categorical variablesunique(df$species)
[1] "A" "B" "C"
# Visualise missing data patternslibrary(naniar)vis_miss(airquality)
The value of data visualisation
The output of vis_mis() cleary demonstrates the advantage of a visual approach to data exploration.
Compare the visualisation to looking at the raw data or a summary of the raw data
airquality$Ozone
[1] 41 36 12 18 NA 28 23 19 8 NA 7 16 11 14 18 14 34 6
[19] 30 11 1 11 4 32 NA NA NA 23 45 115 37 NA NA NA NA NA
[37] NA 29 NA 71 39 NA NA 23 NA NA 21 37 20 12 13 NA NA NA
[55] NA NA NA NA NA NA NA 135 49 32 NA 64 40 77 97 97 85 NA
[73] 10 27 NA 7 48 35 61 79 63 16 NA NA 80 108 20 52 82 50
[91] 64 59 39 9 16 78 35 66 122 89 110 NA NA 44 28 65 NA 22
[109] 59 23 31 44 21 9 NA 45 168 73 NA 76 118 84 85 96 78 73
[127] 91 47 32 20 23 21 24 44 21 28 9 13 46 18 13 24 16 13
[145] 23 36 7 14 30 NA 14 18 20
summary(airquality$Ozone)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.00 18.00 31.50 42.13 63.25 168.00 37
Common plot types and their applications
Different types of data require different visualisation approaches. Let’s explore the most common plot types and when to use them.
Histograms
Purpose and applications:
Visualise distribution of continuous data
Identify central tendency, spread, outliers, and skewness
Examine distributions of measurements in biological data
When to use:
For continuous variables (interval or ratio data)
When you want to understand the shape of a distribution
Examples: heights, weights, temperatures, measurements
Code
# Example using base R with palmerpenguins datahist(penguins$body_mass_g,main ="Distribution of Penguin Body Mass",xlab ="Body Mass (g)",col ="skyblue",border ="white")
Bar plots
Purpose and applications:
Compare values across categories
Show proportions or counts in categorical data
Visualise species abundance, treatment effects
When to use:
For categorical variables (nominal or ordinal data)
When comparing frequencies or counts across groups
Examples: species counts, treatment groups, survey responses
Code
# Example using base R with palmerpenguins dataspecies_counts <-table(penguins$species)barplot(species_counts,main ="Count of Penguins by Species",xlab ="Species",ylab ="Count",col =c("darkorange", "purple", "cyan4"),border ="white")
Scatterplots
Purpose and applications:
Examine relationships between continuous variables
Identify correlations, patterns, and outliers
Explore relationships between measurements
When to use:
When examining relationships between two continuous variables
When looking for correlations or patterns
Examples: height vs. weight, temperature vs. growth rate
Code
# Example using base R with palmerpenguins data# Remove NA values for this examplepenguins_clean <-na.omit(penguins[, c("flipper_length_mm", "body_mass_g")])plot(penguins_clean$flipper_length_mm, penguins_clean$body_mass_g,main ="Relationship Between Flipper Length and Body Mass",xlab ="Flipper Length (mm)",ylab ="Body Mass (g)",pch =19,col ="darkblue")
Boxplots
Purpose and applications:
Compare distributions across groups
Visualise median, quartiles, and outliers
Compare measurements across treatments
When to use:
When comparing a continuous variable across categorical groups
When you need to show the spread and central tendency
Examples: comparing heights across species, measurements across treatments
Code
# Example using base R with palmerpenguins databoxplot(body_mass_g ~ species, data = penguins,main ="Body Mass by Penguin Species",xlab ="Species",ylab ="Body Mass (g)",col =c("darkorange", "purple", "cyan4"),border ="black")
Introduction to ggplot2
The Grammar of Graphics
ggplot2 is based on the Grammar of Graphics, a systematic approach to creating visualisations by combining different components:
Data: The dataset you want to visualise
Aesthetics: Mapping variables to visual properties (position, colour, size, etc.)
Geometries: The shapes used to represent the data (points, lines, bars, etc.)
Scales: How values are mapped to visual properties
Facets: How to split the data into subplots
Coordinates: The coordinate system to use
Themes: Visual styling of the plot
This grammar allows you to build complex visualisations layer by layer.
Why use ggplot2?
Consistent syntax across different plot types
Layered approach makes it easy to build complex visualisations
Excellent defaults that produce publication-quality graphics
Highly customisable with extensive options for fine-tuning
Large community with extensive documentation and examples
Building a plot: Step 1 - Start with data
Let’s build a scatterplot of penguin flipper length vs. body mass using the palmerpenguins dataset.
First, we need to load the ggplot2 package and prepare our data:
Code
library(ggplot2)# Remove missing values for this examplepenguins_clean <-na.omit(penguins)# Look at the first few rows of our datahead(penguins_clean[, c("species", "flipper_length_mm", "body_mass_g")])